Auditory models in isolated word recognition

نویسندگان

Mats Blomberg

Rolf Carlson

Kjell Elenius

Björn Granström

چکیده

A straightforward isolated word recognition system has been used to test different auditory models in acoustic front end processing. The models include BARK, PHON, and SONE. The PHONTEMP model is based on PHON but also includes temporal forward masking. We also introduce a model, DOMIN, whieh is intended to measure the dominating frequency at each point along the 'basilar membrane.' A11 the above models were derived frorn an FFT-analysis, arad the FFT processing is also used as a reference model. One male and one female speaker were used to test the recognition performance of the different models on a difficult vocabulary consisting of 18 Swedish consonants and 9 Swedish vowels. The results indicate that the performance of the models decreases as they become more complex. The overall recognition accuracy of FFT is 97% while it is 87% for SONE. However, the DOMIN model which is sensitive to dominant frequencies (formants) performs very well for vowels. Three different metrics for measuring the distance between speech frames have been tested: city-block, hbclidean, and squared (Euclidean without taking the square root). The Euclidean seems to give a slightly better performance. Reducing the number of channels in the FFT processing clearly shows that performance increases with the number of channels. Introduction The use of auditory models as speech recoynition front ends has recently attracted a great deal of interest. The underlying assumption is that a good model of the auditory system should generate a more natural and efficient representation of speech compared to ordinary spectum analysis. However, we have to keep in mind that only some of the peripheral processes of sound perception are included in most existing n~odels. In this paper we will discuss some standard spectral transformations of the acoustic information and also a model based on the dominant frequency concept. We will also evaluate the performance of the different representations in the context of a standard speech recognition system. Auditorv models Basic research has resulted in several models of the peripheral auditory processing. The elaration of Zwicker and Feldtkeller (1967) of the loudness and the Bark concept has become more or less standard in psychacoustics. Other models, including lateral inhibition and time dependent mechanisms, have heen created elsewhere. The development of methods for similarity rating of speech spectra has been of interest in 1 Tk in an expanded v m i o n 06 a papa prruented at f ie S y m p 0 d ~ W n on ' 7nvatLiance and V a m k b d L t y 06 Speech P/roce~be~ ' , MTT, OcX. 8 1 0 , 7983. many research groups, Plomp (1970), Pols (1970), Bladon and ~indblom (1977), and Carlson and Granstrom (1979). At the same time, efforts have been made to include knowledge of the auditory system in practical applications, Schroeder et al. (1979) and Lyon (1982). Klatt (1979, 1982a, 1982b) has discussed physiologically related spectral representations in models for lexical access and speech recognition systems. New models of the peripheral auditory system are developd based on neurophysiological results, Chistovich et a1 (1979, 1982), Sachs and Young (1980), Saehs et al. (1982), Delgutte (1980, 1982), Dolmazon (1982), Goldhor (1983), and Senef f (1983). This positive development make us believe that in the future we could use these kinds of models as the first analyzing steps in a speech recognition system. In the present paper we try to elaborate some of the basic facts of the auditory mechanisms in the context of such a system. In Fig. 1 we present the different models/transformations that we have used in the current experiment. A pure sinusoid and a vowel are used as test stimuli to illustrate some alternative representations in the amplitude/ frequency domain. Fig. 2 gives examples of computer-generated spectrograms based on some of these models. 'l'he speech signal is first filtered through a sampling filter of 6.3 kHz and digitized at 16 kHz. An FIT-spectrum is calculated every 20 ms using a 25 ms Hamning window. The line spectrum is then transformed in the frequency domain by adding energies to get 300 Hz wide channels. This will reduce the influence of the fu~ldamental frequency on the spectrum and is done in 74 overlapping channels from 0 to 7.7 kJ3z using a linear frequency scale. The result of this processing is seen in Figs. la and 2a (FFT). If we use a Bark scale rubd a bandwidth of one W k , we will have a psychoacoustically more relevant representation (BARK, Fig. lb) . The Hamming window is set to 10 ms with a sampling frequency of 16 kHz to facilitate a fast respnse for frequerlcies higher than 2 kHz, and to 20 ms with a sampling frequency of 4 kHz for frequencies lower than two kHz. The reduction to lower sampling frequency gives a better frequency resolution for the following transformation into the Bark scale. This transfonnation is used in this and ail the following models, while the summation into one WK bands is only done in this model. A psychoacoustic nlasking filter (Schroeder et al., 1979), rather than a sharp bandpass filter, together with equal Loudness curves (phone curves), has been used to derive a phon/~ark plot (PHON, Figs. lc, 2b, and 4a). We argue that the visual impression of Fig. 2b has a much closer relation to the perceived sound than the FFT representation. Note the reduced a-nphasis on the fricative and the position of the very important second formant in the rniddle of the spectrogram. The percep tually prominent lowest foni~ant is also visually erinmced. The phon/Bark representation has beer1 trans formed to a sone/~ark representation which often is claimed to give a better descriptim of the percievd loudness (SOPE, Fig. Id).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Verbal-Auditory Skills in 5-year-Old Children of Semnan/Iran in 2006

Introduction: This research was planned to determine some verbal-auditory skills (verbal-auditory short memory and phonological awareness) that have the closest relationship with speech and language development in 5-year-old children. Method: In this descriptive cross-sectional study, 400 children of pre-school classes affiliated to Education and Welfare organizations in Semnan city were select...

متن کامل

Correlation between Auditory Spectral Resolution and Speech Perception in Children with Cochlear Implants

Background: Variability in speech performance is a major concern for children with cochlear implants (CIs). Spectral resolution is an important acoustic component in speech perception. Considerable variability and limitations of spectral resolution in children with CIs may lead to individual differences in speech performance. The aim of this study was to assess the correlation between auditory ...

متن کامل

Comparison of Auditory Models for Robust Speech Recognition

Two auditory front ends which emulate some aspects of the human auditory system were compared using a high performance isolated word Hidden Markov Model (HMM) speech recognizer. In these initial studies, auditory models from Seneff [2] and Ghitza [4] were compared using both clean speech and speech corrupted by speech-like "babble" noise. Preliminary results indicate that the auditory models re...

متن کامل

Sizing up the competition: quantifying the influence of the mental lexicon on auditory and visual spoken word recognition.

Much research has explored how spoken word recognition is influenced by the architecture and dynamics of the mental lexicon (e.g., Luce and Pisoni, 1998; McClelland and Elman, 1986). A more recent question is whether the processes underlying word recognition are unique to the auditory domain, or whether visually perceived (lipread) speech may also be sensitive to the structure of the mental lex...

متن کامل

Fuzzy Clustering Approach Using Data Fusion Theory and its Application To Automatic Isolated Word Recognition

In this paper, utilization of clustering algorithms for data fusion in decision level is proposed. The results of automatic isolated word recognition, which are derived from speech spectrograph and Linear Predictive Coding (LPC) analysis, are combined with each other by using fuzzy clustering algorithms, especially fuzzy k-means and fuzzy vector quantization. Experimental results show that the...

متن کامل

Improving the noise and spectral robustness of an isolated-word recognizer using an auditory-model front end

In this study, the performance of an auditory-model featureextraction “front end” was assessed in an isolated-word speech recognition task using a common hidden Markov model (HMM) “back end”, and compared with the performance of other feature representation front-end methods including mel-frequency cepstral coefficients (MFCC) and two variants (Jand L-) of the relative spectral amplitude (RASTA...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1984

Auditory models in isolated word recognition

نویسندگان

چکیده

منابع مشابه

Verbal-Auditory Skills in 5-year-Old Children of Semnan/Iran in 2006

Correlation between Auditory Spectral Resolution and Speech Perception in Children with Cochlear Implants

Comparison of Auditory Models for Robust Speech Recognition

Sizing up the competition: quantifying the influence of the mental lexicon on auditory and visual spoken word recognition.

Fuzzy Clustering Approach Using Data Fusion Theory and its Application To Automatic Isolated Word Recognition

Improving the noise and spectral robustness of an isolated-word recognizer using an auditory-model front end

عنوان ژورنال:

اشتراک گذاری